We propose an end-to-end music mixing style transfer system that converts the mixing style of an input multitrack to that of a reference song. This is achieved with an encoder pre-trained with a contrastive objective to extract only audio effects related information from a reference music recording. All our models are trained in a self-supervised manner from an already-processed wet multitrack dataset with an effective data preprocessing method that alleviates the data scarcity of obtaining unprocessed dry data. We analyze the proposed encoder for the disentanglement capability of audio effects and also validate its performance for mixing style transfer through both objective and subjective evaluations. From the results, we show the proposed system not only converts the mixing style of multitrack audio close to a reference but is also robust with mixture-wise style transfer upon using a music source separation model.
translated by 谷歌翻译
Flow-guide synthesis provides a common framework for frame interpolation, where optical flow is typically estimated by a pyramid network, and then leveraged to guide a synthesis network to generate intermediate frames between input frames. In this paper, we present UPR-Net, a novel Unified Pyramid Recurrent Network for frame interpolation. Cast in a flexible pyramid framework, UPR-Net exploits lightweight recurrent modules for both bi-directional flow estimation and intermediate frame synthesis. At each pyramid level, it leverages estimated bi-directional flow to generate forward-warped representations for frame synthesis; across pyramid levels, it enables iterative refinement for both optical flow and intermediate frame. In particular, we show that our iterative synthesis can significantly improve the robustness of frame interpolation on large motion cases. Despite being extremely lightweight (1.7M parameters), UPR-Net achieves excellent performance on a large range of benchmarks. Code will be available soon.
translated by 谷歌翻译
When training early-stage deep neural networks (DNNs), generating intermediate features via convolution or linear layers occupied most of the execution time. Accordingly, extensive research has been done to reduce the computational burden of the convolution or linear layers. In recent mobile-friendly DNNs, however, the relative number of operations involved in processing these layers has significantly reduced. As a result, the proportion of the execution time of other layers, such as batch normalization layers, has increased. Thus, in this work, we conduct a detailed analysis of the batch normalization layer to efficiently reduce the runtime overhead in the batch normalization process. Backed up by the thorough analysis, we present an extremely efficient batch normalization, named LightNorm, and its associated hardware module. In more detail, we fuse three approximation techniques that are i) low bit-precision, ii) range batch normalization, and iii) block floating point. All these approximate techniques are carefully utilized not only to maintain the statistics of intermediate feature maps, but also to minimize the off-chip memory accesses. By using the proposed LightNorm hardware, we can achieve significant area and energy savings during the DNN training without hurting the training accuracy. This makes the proposed hardware a great candidate for the on-device training.
translated by 谷歌翻译
Drowsiness on the road is a widespread problem with fatal consequences; thus, a multitude of systems and techniques have been proposed. Among existing methods, Ghoddoosian et al. utilized temporal blinking patterns to detect early signs of drowsiness, but their algorithm was tested only on a powerful desktop computer, which is not practical to apply in a moving vehicle setting. In this paper, we propose an efficient platform to run Ghoddosian's algorithm, detail the performance tests we ran to determine this platform, and explain our threshold optimization logic. After considering the Jetson Nano and Beelink (Mini PC), we concluded that the Mini PC is the most efficient and practical to run our embedded system in a vehicle. To determine this, we ran communication speed tests and evaluated total processing times for inference operations. Based on our experiments, the average total processing time to run the drowsiness detection model was 94.27 ms for Jetson Nano and 22.73 ms for the Beelink (Mini PC). Considering the portability and power efficiency of each device, along with the processing time results, the Beelink (Mini PC) was determined to be most suitable. Also, we propose a threshold optimization algorithm, which determines whether the driver is drowsy or alert based on the trade-off between the sensitivity and specificity of the drowsiness detection model. Our study will serve as a crucial next step for drowsiness detection research and its application in vehicles. Through our experiment, we have determinend a favorable platform that can run drowsiness detection algorithms in real-time and can be used as a foundation to further advance drowsiness detection research. In doing so, we have bridged the gap between an existing embedded system and its actual implementation in vehicles to bring drowsiness technology a step closer to prevalent real-life implementation.
translated by 谷歌翻译
在半导体制造中,晶圆地图缺陷模式为设施维护和产量管理提供了关键信息,因此缺陷模式的分类是制造过程中最重要的任务之一。在本文中,我们提出了一种新颖的方式来表示缺陷模式作为有限维矢量的形状,该矢量将用作分类神经网络算法的输入。主要思想是使用拓扑数据分析(TDA)的持续同源性理论提取每种模式的拓扑特征。通过使用模拟数据集进行的一些实验,我们表明,与使用卷积神经网络(CNN)的方法相比,所提出的方法在训练方面更快,更有效地训练,这是晶圆映射缺陷模式分类的最常见方法。此外,当训练数据的数量不够并且不平衡时,我们的方法优于基于CNN的方法。
translated by 谷歌翻译
在本文中,我们通过变异自动编码器(VAE)研究了基于弦的分子生成的问题,这些问题已经为人工智能的各种任务提供了一种流行的生成方法。我们提出了一个简单而有效的想法,以提高VAE的任务绩效。我们的主要思想是在共享单个编码器时维护多个解码器,即它是一种合奏技术。在这里,我们首先发现,由于合奏解码器的偏见在其自动回归推理下严重增加,因此每个解码器都可能没有有效。为了维持集合模型的较小偏见和差异,我们提出的技术是两倍:(a)为每个解码器采样不同的潜在变量(从共享编码器提供的估计平均值和差异)来鼓励解码器的多元化特征(b)在培训期间使用协作损失,以控制使用不同的潜在变量的解码器的汇总质量。在我们的实验中,提出的VAE模型特别表现出色,可从域外分布产生样品。
translated by 谷歌翻译
我们研究了情节块MDP中模型估计和无奖励学习的问题。在这些MDP中,决策者可以访问少数潜在状态产生的丰富观察或上下文。我们首先对基于固定行为策略生成的数据估算潜在状态解码功能(从观测到潜在状态的映射)感兴趣。我们在估计此功能的错误率上得出了信息理论的下限,并提出了接近此基本限制的算法。反过来,我们的算法还提供了MDP的所有组件的估计值。然后,我们研究在无奖励框架中学习近乎最佳政策的问题。根据我们有效的模型估计算法,我们表明我们可以以最佳的速度推断出策略(随着收集样品的数量增长大)的最佳策略。有趣的是,我们的分析提供了必要和充分的条件,在这些条件下,利用块结构可以改善样本复杂性,以识别近乎最佳的策略。当满足这些条件时,Minimax无奖励设置中的样本复杂性将通过乘法因子$ n $提高,其中$ n $是可能的上下文数量。
translated by 谷歌翻译
我们为基于运动的视频框架插值提供了一种新颖的简单而有效的算法。现有的基于运动的插值方法通常依赖于预先训练的光流模型或基于U-NET的金字塔网络进行运动估计,该运动估计要么具有较大的模型大小或有限的处理复合物和大型运动案例的容量。在这项工作中,通过仔细整合了中间方向的前射击,轻质特征编码器和相关量为金字塔复发框架,我们得出一个紧凑的模型,以同时估计输入帧之间的双向运动。它的尺寸比PWC-NET小15倍,但可以更可靠,更灵活地处理具有挑战性的运动案例。基于估计的双向运动,我们向前射击输入帧及其上下文特征到中间帧,并采用合成网络来估算扭曲表示的中间帧。我们的方法在广泛的视频框架插值基准测试中实现了出色的性能。代码将很快可用。
translated by 谷歌翻译
了解多媒体内容中描述或显示的事件彼此相关是开发可用于真实世界媒体的强大人工智能系统的关键组成部分。尽管许多研究专门用于文本,图像和视频域中的事件理解,但没有一个研究探索事件跨域中经历的复杂关系。例如,新闻文章可能会描述“抗议”事件,而视频显示“逮捕”事件。认识到视觉“逮捕”事件是更广泛的“抗议”事件的一个子事件,这是一个具有挑战性但重要的问题,但前面的工作尚未探讨。在本文中,我们提出了多模式事件关系关系的新任务,以识别这种跨模式事件关系。我们贡献了一个大规模数据集,该数据集由100K视频新文章对组成,以及密集注释的数据的基准。我们还提出了一种弱监督的多模式方法,该方法将来自外部知识库(KB)的常识性知识整合在一起,以预测丰富的多模式事件层次结构。实验表明,我们的模型在我们提出的基准上优于许多竞争基线。我们还对模型的性能进行了详细的分析,并建议未来研究的方向。
translated by 谷歌翻译
组织病理学仍然是各种癌症诊断的黄金标准。计算机视觉的最新进展,特别是深度学习,促进了针对各种任务的组织病理学图像的分析,包括免疫细胞检测和微卫星不稳定性分类。每个任务的最新工作通常采用鉴定的基础体系结构,这些体系结构已鉴定为图像分类。开发组织病理学分类器的标准方法倾向于将重点放在优化单个任务的模型上,而不是考虑建模创新的各个方面,从而改善了跨任务的概括。在这里,我们提出了Champkit(模型预测工具包的全面组织病理学评估):可扩展的,完全可重现的基准测试工具包,由大量的斑点级图像分类任务组成,跨不同的癌症。 Champkit能够系统地记录模型和方法中提议改进的性能影响的一种方法。 Champkit源代码和数据可在https://github.com/kaczmarj/champkit上自由访问。
translated by 谷歌翻译